AITopics | Text Mining

Collaborating Authors

Text Mining

News Overviews Instructional Materials AI-Alerts Classics

ChemX: ACollection of Chemistry Datasets for Benchmarking Automated Information Extraction

Neural Information Processing SystemsJun-20-2026, 17:38:28 GMT

Despite recent advances in machine learning, many scientific discoveries in chemistry still rely on manually curated datasets extracted from the scientific literature. Automation of information extraction in specialized chemistry domains has the potential to scale up machine learning applications and improve the quality of predictions, enabling data-driven scientific discoveries at a faster pace. In this paper, we present ChemX, a collection of 10 benchmarking datasets across several domains of chemistry providing a reliable basis for evaluating and fine-tuning automated information extraction methods. The datasets encompassing various properties of small molecules and nanomaterials have been manually extracted from peer-reviewed publications and systematically validated by domain experts through a cross-verification procedure allowing for identification and correction of errors at sources. In order to demonstrate the utility of the resulting datasets, we evaluate the extraction performance of the state-of-the-art large language models (LLMs). Moreover, we design our own agentic approach to take full control of the document preprocessing before LLM-based information extraction.

data mining, large language model, machine learning, (21 more...)

Neural Information Processing Systems

Country: Europe > Russia (0.28)

Genre: Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.68)
Materials > Chemicals > Commodity Chemicals > Petrochemicals (0.68)
Energy (0.67)

Technology:

Information Technology > Data Science > Data Mining > Text Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Neurosymbolic Information Extraction from Transactional Documents

Hemmer, Arthur, Coustaty, Mickaël, Bartolo, Nicola, Ogier, Jean-Marc

arXiv.org Artificial IntelligenceDec-11-2025

This paper presents a neurosymbolic framework for information extraction from documents, evaluated on transactional documents. We introduce a schema-based approach that integrates symbolic validation methods to enable more effective zero-shot output and knowledge distillation. The methodology uses language models to generate candidate extractions, which are then filtered through syntactic-, task-, and domain-level validation to ensure adherence to domain-specific arithmetic constraints. Our contributions include a comprehensive schema for transactional documents, relabeled datasets, and an approach for generating high-quality labels for knowledge distillation. Experimental results demonstrate significant improvements in $F_1$-scores and accuracy, highlighting the effectiveness of neurosymbolic validation in transactional document processing.

constraint, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1007/s10032-025-00530-0

2512.09666

Country:

Europe (0.28)
North America > United States (0.28)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Data Science > Data Mining > Text Mining (0.72)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.72)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

FS-DAG: Few Shot Domain Adapting Graph Networks for Visually Rich Document Understanding

Agarwal, Amit, Panda, Srikant, Pachauri, Kulbhushan

arXiv.org Artificial IntelligenceNov-13-2025

In this work, we propose Few Shot Domain Adapting Graph (FS-DAG), a scalable and efficient model architecture for visually rich document understanding (VRDU) in few-shot settings. FS-DAG leverages domain-specific and language/vision specific backbones within a modular framework to adapt to diverse document types with minimal data. The model is robust to practical challenges such as handling OCR errors, misspellings, and domain shifts, which are critical in real-world deployments. FS-DAG is highly performant with less than 90M parameters, making it well-suited for complex real-world applications for Information Extraction (IE) tasks where computational resources are limited. We demonstrate FS-DAG's capability through extensive experiments for information extraction task, showing significant improvements in convergence speed and performance compared to state-of-the-art methods. Additionally, this work highlights the ongoing progress in developing smaller, more efficient models that do not compromise on performance. Code : https://github.com/oracle-samples/fs-dag

data mining, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2505.1733

Country: Europe > Austria (0.28)

Genre: Research Report > New Finding (0.67)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.89)
Information Technology > Data Science > Data Mining > Text Mining (0.69)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Reliable End-to-End Material Information Extraction from the Literature with Source-Tracked Multi-Stage Large Language Models

Wang, Xin, Raj, Anshu, Luebbe, Matthew, Wen, Haiming, Xu, Shuozhi, Lu, Kun

arXiv.org Artificial IntelligenceOct-8-2025

Data-driven materials discovery requires large-scale experimental datasets, yet most of the information remains trapped in unstructured literature. Existing extraction efforts often focus on a limited set of features and have not addressed the integrated composition-processing-microstructure-property relationships essential for understanding materials behavior, thereby posing challenges for building comprehensive databases. To address this gap, we propose a multi-stage information extraction pipeline powered by large language models, which captures 47 features spanning composition, processing, microstructure, and properties exclusively from experimentally reported materials. The pipeline integrates iterative extraction with source tracking to enhance both accuracy and reliability. Evaluations at the feature level (independent attributes) and tuple level (interdependent features) yielded F1 scores around 0.96. Compared with single-pass extraction without source tracking, our approach improved F1 scores of microstructure category by 10.0% (feature level) and 13.7% (tuple level), and reduced missed materials from 49 to 13 out of 396 materials in 100 articles on precipitate-containing multi-principal element alloys (miss rate reduced from 12.4% to 3.3%). The pipeline enables scalable and efficient literature mining, producing databases with high precision, minimal omissions, and zero false positives. These datasets provide trustworthy inputs for machine learning and materials informatics, while the modular design generalizes to diverse material classes, enabling comprehensive materials information extraction.

data mining, large language model, machine learning, (23 more...)

arXiv.org Artificial Intelligence

2510.05142

Country: North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Data Science > Data Mining > Text Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
(2 more...)

Add feedback

Benchmarking Agentic Systems in Automated Scientific Information Extraction with ChemX

Vepreva, Anastasia, Razlivina, Julia, Eremeeva, Maria, Gubina, Nina, Orlova, Anastasia, Dmitrenko, Aleksei, Kapranova, Ksenya, Jyakhwo, Susan, Vasilev, Nikita, Sarkisyan, Arsen, Chernyshov, Ivan Yu., Vinogradov, Vladimir, Dmitrenko, Andrei

arXiv.org Artificial IntelligenceOct-2-2025

The emergence of agent-based systems represents a significant advancement in artificial intelligence, with growing applications in automated data extraction. However, chemical information extraction remains a formidable challenge due to the inherent heterogeneity of chemical data. Current agent-based approaches, both general-purpose and domain-specific, exhibit limited performance in this domain. To address this gap, we present ChemX, a comprehensive collection of 10 manually curated and domain-expert-validated datasets focusing on nanomaterials and small molecules. These datasets are designed to rigorously evaluate and enhance automated extraction methodologies in chemistry. To demonstrate their utility, we conduct an extensive benchmarking study comparing existing state-of-the-art agentic systems such as ChatGPT Agent and chemical-specific data extraction agents. Additionally, we introduce our own single-agent approach that enables precise control over document preprocessing prior to extraction. We further evaluate the performance of modern baselines, such as GPT-5 and GPT-5 Thinking, to compare their capabilities with agentic approaches. Our empirical findings reveal persistent challenges in chemical information extraction, particularly in processing domain-specific terminology, complex tabular and schematic representations, and context-dependent ambiguities. The ChemX benchmark serves as a critical resource for advancing automated information extraction in chemistry, challenging the generalization capabilities of existing methods, and providing valuable insights into effective evaluation strategies.

data mining, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2510.00795

Country:

Europe > Switzerland (0.28)
Europe > Russia (0.28)

Genre: Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Materials > Chemicals (0.93)
Health & Medicine > Therapeutic Area (0.68)

Technology:

Information Technology > Data Science > Data Mining > Text Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

Combining Constrained and Unconstrained Decoding via Boosting: BoostCD and Its Application to Information Extraction

Šakota, Marija, West, Robert

arXiv.org Artificial IntelligenceSep-24-2025

Many recent approaches to structured NLP tasks use an autoregressive language model $M$ to map unstructured input text $x$ to output text $y$ representing structured objects (such as tuples, lists, trees, code, etc.), where the desired output structure is enforced via constrained decoding. During training, these approaches do not require the model to be aware of the constraints, which are merely implicit in the training outputs $y$. This is advantageous as it allows for dynamic constraints without requiring retraining, but can lead to low-quality output during constrained decoding at test time. We overcome this problem with Boosted Constrained Decoding (BoostCD), which combines constrained and unconstrained decoding in two phases: Phase 1 decodes from the base model $M$ twice, in constrained and unconstrained mode, obtaining two weak predictions. In phase 2, a learned autoregressive boosted model combines the two weak predictions into one final prediction. The mistakes made by the base model with vs. without constraints tend to be complementary, which the boosted model learns to exploit for improved performance. We demonstrate the power of BoostCD by applying it to closed information extraction. Our model, BoostIE, outperforms prior approaches both in and out of distribution, addressing several common errors identified in those approaches.

data mining, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2506.14901

Country:

North America > United States (1.00)
Asia (0.68)
Europe (0.68)

Genre: Research Report (1.00)

Industry: Leisure & Entertainment (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
Information Technology > Data Science > Data Mining > Text Mining (0.61)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.61)
(2 more...)

Add feedback

RanAT4BIE: Random Adversarial Training for Biomedical Information Extraction

Chen, Jian, Lv, Shengyi, Su, Leilei

arXiv.org Artificial IntelligenceSep-16-2025

Abstract--We introduce random adversarial training (RA T), a novel framework successfully applied to biomedical information extraction (BioIE) tasks. While adversarial training yields significant improvements across various performance metrics, it also introduces considerable computational overhead. T o address this limitation, we propose RA T as an efficiency solution for biomedical information extraction. Through comprehensive evaluations, RA T demonstrates superior performance compared to baseline models in BioIE tasks. Adversarial training was initially conceptualized as a methodology for enhancing the robustness of deep learning models [1].

adversarial training, data mining, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2509.11191

Country:

North America > United States (1.00)
Europe (1.00)
North America > Canada (0.68)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Security & Privacy (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
Information Technology > Data Science > Data Mining > Text Mining (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.87)

Add feedback

MR-UIE: Multi-Perspective Reasoning with Reinforcement Learning for Universal Information Extraction

Li, Zhongqiu, Wang, Shiquan, Fang, Ruiyu, Bao, Mengjiao, Wu, Zhenhe, Song, Shuangyong, Li, Yongxiang, He, Zhongjiang

arXiv.org Artificial IntelligenceSep-12-2025

Information extraction (IE) is a fundamental task in natural language processing (NLP), which encompasses a wide range of subtasks such as Named Entity Recognition (NER), Relation Extraction (RE), and Event Extraction (EE) [1-4]. Traditionally, these tasks have been addressed by specialized models trained in task-specific datasets. However, the fragmentation of tasks and schemas has hindered the development of generalizable and scalable IE tasks. To address this limitation, recent research has focused on universal information extraction (UIE), which aims to model all IE tasks within a universal framework. A seminal work in this direction is proposed by Lu et al., which introduced a structured generation paradigm that encodes diverse IE tasks into a common semantic representation[5]. Building on this, InstructUIE[6] extended the idea by incorporating multi-task instruction tuning, enabling models to generalize across tasks via natural language instructions. With the emergence of powerful LLMs[7-11], significant advancements have been made across long-standing NLP tasks such as text classification[12-16], intent recognition[17, 18], entity linking[19-22], and beyond. Inspired by their robust performance and adaptability, researchers have explored their potential for information extraction through prompting and in-context learning learning[23, 24]. For example, CodeIE demonstrated that code generation models can serve as strong few-shot IE extractors by using structured code-like commands[25].

data mining, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2509.09082

Country:

Europe (1.00)
North America > United States (0.93)
Asia > Middle East > UAE (0.46)

Genre: Research Report (0.64)

Technology:

Information Technology > Data Science > Data Mining > Text Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(3 more...)

Add feedback

Low-Resource Fine-Tuning for Multi-Task Structured Information Extraction with a Billion-Parameter Instruction-Tuned Model

Chih, Yu Cheng, Hou, Yong Hao

arXiv.org Artificial IntelligenceSep-11-2025

Deploying large language models (LLMs) for structured data extraction in domains such as financial compliance reporting, legal document analytics, and multilingual knowledge base construction is often impractical for smaller teams due to the high cost of running large architectures and the difficulty of preparing large, high-quality datasets. Most recent instruction-tuning studies focus on seven-billion-parameter or larger models, leaving limited evidence on whether much smaller models can work reliably under low-resource, multi-task conditions. This work presents ETLCH, a billion-parameter LLaMA-based model fine-tuned with low-rank adaptation on only a few hundred to one thousand samples per task for JSON extraction, knowledge graph extraction, and named entity recognition. Despite its small scale, ETLCH outperforms strong baselines across most evaluation metrics, with substantial gains observed even at the lowest data scale. These findings demonstrate that well-tuned small models can deliver stable and accurate structured outputs at a fraction of the computational cost, enabling cost-effective and reliable information extraction pipelines in resource-constrained environments.

data mining, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2509.08381

Country:

Asia > Taiwan (0.14)
Asia > Thailand (0.14)

Genre: Research Report > New Finding (0.90)

Industry:

Health & Medicine > Therapeutic Area (1.00)
Health & Medicine > Consumer Health (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Data Science > Data Mining > Text Mining (0.93)

Add feedback

Joint Information Extraction Across Classical and Modern Chinese with Tea-MOELoRA

Tang, Xuemei, Yan, Chengxi, Gu, Jinghang, Huang, Chu-Ren

arXiv.org Artificial IntelligenceSep-10-2025

Chinese information extraction (IE) involves multiple tasks across diverse temporal domains, including Classical and Modern documents. Fine-tuning a single model on heterogeneous tasks and across different eras may lead to interference and reduced performance. Therefore, in this paper, we propose Tea-MOELoRA, a parameter-efficient multi-task framework that combines LoRA with a Mixture-of-Experts (MoE) design. Multiple low-rank LoRA experts specialize in different IE tasks and eras, while a task-era-aware router mechanism dynamically allocates expert contributions. Experiments show that Tea-MOELoRA outperforms both single-task and joint LoRA baselines, demonstrating its ability to leverage task and temporal knowledge effectively.

data mining, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2509.01158

Country:

Asia > China (0.47)
North America > United States (0.28)
Europe > Austria > Vienna (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.98)
(2 more...)

Add feedback